{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Motif and Transcription Factor enrichment after Mowgli integration"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This notebook sums up guidelines to: \n",
    "\n",
    "* extract the top features (e.g., genes or peaks) for all dimensions in the Mowgli embedding.\n",
    "* perform a motif enrichment analysis (using peaks) (as done for Figure 5 of our [manuscript](https://www.nature.com/articles/s41467-023-43019-2)).\n",
    "* perform a TF enrichment analysis (using genes) using a collection of TF->Targets.\n",
    "\n",
    "**NOTE #1:** This notebook uses both R and Python code. We recommend to copy paste in your local Jupyter or Rstudio session and run the code in the corresponding language. \n",
    "\n",
    "**NOTE #2:** The enrichments are performed on human data. Change this code and the databases accordingly if you are working with other species. \n",
    "\n",
    "**NOTE #3:** It is possible also to match the TF and motif enrichment to get a better viee of the relationship between the transcriptional and epigenetic landscape. We do not cover here this analysis since it's very case-specific  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Extract top features of a modality from Mowgli"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We assume that the results of Mowgli integration are in a `mdata` object that store genes in the `rna` and peaks in the `atac` slot."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Python\n",
    "# method to extract the top features\n",
    "def top_mowgli(dim, n, H_mowgli):\n",
    "    \"\"\"\n",
    "    Get the top n peaks for a given dimension.\n",
    "    \"\"\"\n",
    "    H_scaled = H_mowgli / H_mowgli.sum(axis=1, keepdims=True)\n",
    "    return H_scaled[:, dim].argsort()[::-1][:n]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Extract the top peaks"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Python\n",
    "# Peaks\n",
    "\n",
    "# we set the number of peaks to look at\n",
    "n_peaks = 100\n",
    "\n",
    "# Get the genes or peak dictionaries\n",
    "H_mowgli_atac = mdata[\"atac\"].uns[\"H_OT\"]\n",
    "\n",
    "# actual features extraction\n",
    "mdata[\"atac\"].var_names = mdata[\"atac\"].var_names.str.replace(\"atac:\", \"\")\n",
    "top_in_mowgli = mdata[\"atac\"].var.copy()\n",
    "\n",
    "# Fill the Mowgli top peaks.\n",
    "for dim in range(H_mowgli_atac.shape[1]):\n",
    "    col_name = f\"top_in_dim_{dim}\"\n",
    "    idx = top_in_mowgli.index[top_mowgli(dim, n_peaks, H_mowgli_atac)]\n",
    "    top_in_mowgli[col_name] = False\n",
    "    top_in_mowgli.loc[idx, col_name] = True\n",
    "\n",
    "# Save Mowgli's top peaks.\n",
    "top_in_mowgli.to_csv(\"top_peaks_in_mowgli.csv\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Extract the top genes (for other expression-space based enrchments)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Python\n",
    "# we set the number of peaks to look at\n",
    "n_genes = 100\n",
    "\n",
    "H_mowgli_rna = mdata[\"rna\"].uns[\"H_OT\"]\n",
    "\n",
    "# select the top genes to probe using only the highly variable genes (our universe)\n",
    "top_in_mowgli = (\n",
    "    mdata[\"rna\"].var.loc[mdata[\"rna\"].var[\"highly_variable\"] == True, :].copy()\n",
    ")  # the var coordinates\n",
    "\n",
    "for dim in range(H_mowgli_rna.shape[1]):\n",
    "    print(dim)\n",
    "    # name of the column iun the var object that will be used to extract the top peaks for each gfiven dimenssion\n",
    "    col_name = f\"top_in_dim_{dim}\"\n",
    "    idx = top_in_mowgli.index[\n",
    "        top_mowgli(dim, n_genes, H_mowgli_rna)\n",
    "    ]  # indices of the top features for that given dimensions. will be used for localizing the peaks afterwasrds\n",
    "    top_in_mowgli[col_name] = False  # set all value for that dimesions to False\n",
    "    top_in_mowgli.loc[idx, col_name] = True  # set to True only the peaks that are\n",
    "\n",
    "# Save Mowgli's top genes.\n",
    "top_in_mowgli.to_csv(\n",
    "    os.path.join(top_feats_dir, f\"top_genes_in mowgli.csv\"),\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Motif enrichment \n",
    "\n",
    "This code was used in the [original publication](https://doi.org/10.1038/s41467-023-43019-2) to perform motif enrichment analysis from chromatin accessibility (Figure 5-C).  This notebook is a summarisation of the code that is stored in our [Mowgli reproducibility](https://github.com/cantinilab/mowgli_reproducibility) repository."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "r"
    }
   },
   "outputs": [],
   "source": [
    "# R\n",
    "# Imports.\n",
    "library(GenomicRanges)\n",
    "library(motifmatchr)\n",
    "library(chromVAR)\n",
    "library(TFBSTools)\n",
    "library(JASPAR2022)\n",
    "library(Signac)\n",
    "library(BSgenome.Hsapiens.UCSC.hg38)\n",
    "library(chromVARmotifs)\n",
    "library(MuData)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "r"
    }
   },
   "outputs": [],
   "source": [
    "# R\n",
    "# Read atac file.\n",
    "in_atac <- \"top_peaks_in_mowgli.csv\" # nolint\n",
    "peaks_csv <- read.csv(in_atac, row.names = 2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "r"
    }
   },
   "outputs": [],
   "source": [
    "# R\n",
    "# Optional: Remove non-canonical chromosomes.\n",
    "peaks_csv < -peaks_csv[peaks_csv[\"Chromosome\"] != \"GL000194.1\",]\n",
    "peaks_csv < -peaks_csv[peaks_csv[\"Chromosome\"] != \"GL000205.2\",]\n",
    "peaks_csv < -peaks_csv[peaks_csv[\"Chromosome\"] != \"GL000205.2\",]\n",
    "peaks_csv < -peaks_csv[peaks_csv[\"Chromosome\"] != \"GL000219.1\",]\n",
    "peaks_csv < -peaks_csv[peaks_csv[\"Chromosome\"] != \"GL000219.1\",]\n",
    "peaks_csv < -peaks_csv[peaks_csv[\"Chromosome\"] != \"KI270721.1\",]\n",
    "peaks_csv < -peaks_csv[peaks_csv[\"Chromosome\"] != \"KI270726.1\",]\n",
    "peaks_csv < -peaks_csv[peaks_csv[\"Chromosome\"] != \"KI270726.1\",]\n",
    "peaks_csv < -peaks_csv[peaks_csv[\"Chromosome\"] != \"KI270713.1\",]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "r"
    }
   },
   "outputs": [],
   "source": [
    "# R\n",
    "# Convert the peaks to GRanges.\n",
    "chromosomes <- peaks_csv[\"Chromosome\"][, 1]\n",
    "ranges <- IRanges::IRanges(\n",
    "    start = peaks_csv[\"Start\"][, 1],\n",
    "    end = peaks_csv[\"End\"][, 1]\n",
    ")\n",
    "peaks <- GenomicRanges::GRanges(seqnames = chromosomes, ranges = ranges)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "r"
    }
   },
   "outputs": [],
   "source": [
    "# R\n",
    "# Get JASPAR motifs.\n",
    "opts <- list()\n",
    "opts[\"species\"] <- \"Homo sapiens\"\n",
    "opts[\"collection\"] <- \"CORE\"\n",
    "motifs <- TFBSTools::getMatrixSet(JASPAR2022::JASPAR2022, opts)\n",
    "motifs_pwm <- TFBSTools::toPWM(motifs)\n",
    "\n",
    "# Get cisBP motifs.\n",
    "data(\"human_pwms_v2\")\n",
    "\n",
    "# Fuse JASPAR and cisBP motifs.\n",
    "for (name in names(motifs_pwm)) {\n",
    "    human_pwms_v2[name] <- motifs_pwm[name]\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "r"
    }
   },
   "outputs": [],
   "source": [
    "# R\n",
    "# Create a 'dummy' Signac object from the peaks.\n",
    "# Actually giving peaks_csv is nonsense.\n",
    "# But we only care about the rownames so it's fine.\n",
    "assay <- Signac::CreateChromatinAssay(\n",
    "    peaks_csv,\n",
    "    ranges = peaks,\n",
    "    sep = c(\":\", \"-\")\n",
    ")\n",
    "\n",
    "# Create statistics about peaks.\n",
    "assay <- Signac::RegionStats(\n",
    "    object = assay,\n",
    "    genome = BSgenome.Hsapiens.UCSC.hg38\n",
    ")\n",
    "\n",
    "# Add the downloaded motif PWM annotation.\n",
    "assay <- Signac::AddMotifs(\n",
    "    object = assay,\n",
    "    genome = BSgenome.Hsapiens.UCSC.hg38,\n",
    "    pfm = human_pwms_v2\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "r"
    }
   },
   "outputs": [],
   "source": [
    "# R\n",
    "# Define where to save the motif enrichment outputs.\n",
    "out_motif <- \"motifs_\"\n",
    "\n",
    "# Get all top peaks.\n",
    "background <- c()\n",
    "for (dim in 0:49) {\n",
    "\n",
    "    # Get the top peaks for that dimension.\n",
    "    features <- rownames(assay)[peaks_csv[paste0(\"top_in_dim_\", dim)] == \"True\"]\n",
    "\n",
    "    background <- c(background, features)\n",
    "}\n",
    "\n",
    "# Iterate over Mowgli's dimensions.\n",
    "for (dim in 0:49) {\n",
    "\n",
    "    # Get the top peaks for that dimension.\n",
    "    features <- rownames(assay)[peaks_csv[paste0(\"top_in_dim_\", dim)] == \"True\"]\n",
    "\n",
    "    # Do motif enrichment analysis.\n",
    "    enriched_motifs <- Signac::FindMotifs(\n",
    "        object = assay,\n",
    "        features = features,\n",
    "        background = background\n",
    "    )\n",
    "\n",
    "    # Save the enrichment.\n",
    "    write.csv(enriched_motifs, paste0(out_motif, dim, \".csv\"))\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "vscode": {
     "languageId": "r"
    }
   },
   "source": [
    "## TF Enrichment using top genes in mowgli dimensions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "vscode": {
     "languageId": "r"
    }
   },
   "source": [
    "We perform here a standard TF enrichment using the top features identifuied in the RNA space for each dimension of Mowgli.\n",
    "\n",
    "In this case example, we made use of the [Regulatory Circuits](https://doi.org/10.1038/nmeth.3799) database ([link](http://www2.unil.ch/cbg/regulatorycircuits/FANTOM5_individual_networks.tar)), but we recommend the users to choose the most appropriate TF->Target database according to his domain and prior biological information.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "r"
    }
   },
   "outputs": [],
   "source": [
    "# Reload libraries\n",
    "\n",
    "library(stats)\n",
    "library(dplyr)\n",
    "library(stringr)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "r"
    }
   },
   "outputs": [],
   "source": [
    "# R \n",
    "\n",
    "# reload GRN and make a TF-> Target list\n",
    "\n",
    "# network of epithelial cells\n",
    "grn.path <- \"Regulatory_circuits_mammary_epithelial_cell.txt\"\n",
    "grn <- read.table(grn.path, sep=\"\\t\", header = F)\n",
    "colnames(grn) <- c(\"TF\", \"Target\", \"score\")\n",
    "\n",
    "# make a TF -> Target list\n",
    "tf_list <- split(grn$Target, grn$TF)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "r"
    }
   },
   "outputs": [],
   "source": [
    "# R\n",
    "\n",
    "# reload top features for RNA and reparse it\n",
    "top_feats.path <- \"top_genes_in_mowgli.csv\"\n",
    "top_feats <- read.table(top_feats.path, sep = \",\", header = T)\n",
    "# set row names to index\n",
    "row.names(top_feats) <- top_feats$hgnc_symbol\n",
    "\n",
    "cols_to_keep <- c(\"highly_variable\", grep(\"top_in_dim\", names(top_feats), value = TRUE))\n",
    "\n",
    "top_feats.filtered <- top_feats %>%\n",
    "  select(all_of(cols_to_keep))\n",
    "\n",
    "top_feats.filtered <- top_feats.filtered %>%\n",
    "  mutate(\n",
    "    `highly_variable` = as.logical(ifelse(`highly_variable` == \"True\", TRUE, FALSE)),\n",
    "    across(starts_with(\"top_in_dim\"), ~ as.logical(ifelse(. == \"True\", TRUE, FALSE)))\n",
    "  )\n",
    "\n",
    "# define the universe -> in this case, a list of highly variable genes in the RNA slot\n",
    "universe <- readLines(\"highly_variable_genes.txt\")\n",
    "\n",
    "universe.len <- length(universe)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "vscode": {
     "languageId": "r"
    }
   },
   "source": [
    "In brief:\n",
    "- we loop through each dimension and we select for each dimension the top features\n",
    "- we loop through each TF (using a groupby) and we identify the top sets of features\n",
    "- we make a hypergeometric test \n",
    "- we calculate an enrichment score enrichment\n",
    "- we correct the pvalue using Benjamini-hochberg correction\n",
    "- we write the results to a file (only for significant TFs enriched)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "r"
    }
   },
   "outputs": [],
   "source": [
    "# R\n",
    "\n",
    "# directory that will store the results\n",
    "res.dir <- \"tf_enrichment\"\n",
    "\n",
    "# colnames of the dataframe storing the results\n",
    "res.colnames <- c(\"TF\", \"number_of_targets\", \"enrichment_score\", \"p.val\", \"p.adjust\")\n",
    "\n",
    "# loop through all the top mowgli dimensions\n",
    "for (dim in names(top_feats.filtered)[grepl(\"^top_in\", names(top_feats.filtered))]) {\n",
    "    dim_number <- str_extract(dim, \"\\\\d+$\")\n",
    "    print(paste(\"enriching:\", dim_number))\n",
    "    \n",
    "    # select the top features for that dimensiob\n",
    "    top_genes <- rownames(top_feats.filtered[top_feats.filtered[[dim]] == TRUE, ])\n",
    "    top_genes.len <- length(top_genes) # should always be 100\n",
    "    ratioDim <- top_genes.len / universe.len # number of genes in the universe, shoukd always be the same numbe, useful for the enrichment\n",
    "\n",
    "    # open the output file\n",
    "    output_file_name <- file.path(res.dir, paste0(\"enriched_TFs_dim\", dim_number, \".tsv\"))\n",
    "    res.df <- data.frame(matrix(ncol= length(res.colnames), nrow = 0))\n",
    "    colnames(res.df) <- res.colnames\n",
    "   \n",
    "    # define the list to store files \n",
    "    p_values <- numeric()\n",
    "    tfs <- character()\n",
    "    n_targets <- numeric()\n",
    "    enrichment_scores <- numeric()\n",
    "\n",
    "    # loop through all the TFs and perform the enrichment\n",
    "    for (tf in names(tf_list)){\n",
    "        targets <- tf_list[[tf]]\n",
    "        \n",
    "        x <- length(intersect(targets, top_genes)) # white balls, i.e. how many genes in top dim are in the TF targets\n",
    "        m <- length(intersect(universe, targets)) # number of white balls in the urn, i.e., how many targets are in the universe\n",
    "        n <- universe.len - length(intersect(targets, universe)) # number of black balls in the urn, i.e. how many genes in the universe are NOT in the targets\n",
    "        k <- top_genes.len # the size of the balls drawn, always 100 (the number of top genes in the dimension)\n",
    "        \n",
    "        p_value <- phyper(x, m, n, k, lower.tail = FALSE) #select as significantonly the over enriched\n",
    "        n_targets_expressed <- length(intersect(targets, rownames(universe)))\n",
    "        n_targets_feats <- x/universe.len\n",
    "        ratioTargets <- ifelse(n_targets_expressed == 0, 0, x / n_targets_expressed)\n",
    "        enrichment_score <- 1/ (ratioTargets / ratioDim)\n",
    "        \n",
    "        # Store results for FDR adjustment\n",
    "        p_values <- c(p_values, p_value)\n",
    "        tfs <- c(tfs, tf)\n",
    "        n_targets <- c(n_targets, x)\n",
    "        enrichment_scores <- c(enrichment_scores, enrichment_score)\n",
    "        }\n",
    "\n",
    "    # Adjust p-values using Benjamini-Hochberg correction\n",
    "    adjusted_p_values <- p.adjust(p_values, method = \"BH\")\n",
    "\n",
    "    # Combine results into a dataframe\n",
    "    results <- data.frame(TF = tfs, number_of_targets = n_targets, enrichment_score = enrichment_scores, p_value = p_values, adjusted_p_value = adjusted_p_values)\n",
    "    \n",
    "    # Filter results for significant adjusted p-values\n",
    "    significant_results <- results[results$adjusted_p_value <= 0.05, ]\n",
    "    \n",
    "    # Save significant results to file\n",
    "    output_file_name <- file.path(res.dir, paste0(\"enriched_TFs_dim\", dim_number, \".tsv\"))\n",
    "    write.table(significant_results, file = output_file_name, sep = \"\\t\", row.names = FALSE, col.names = TRUE, quote = FALSE)\n",
    "}"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.8.5 ('base')",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  },
  "nbsphinx": {
   "execute": "never"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}